Goto

Collaborating Authors

 tomek link


Foundations of data imbalance and solutions for a data democracy

arXiv.org Artificial Intelligence

Dealing with imbalanced data is a prevalent problem while performing classification on the datasets. Many times, this problem contributes to bias while making decisions or implementing policies. Thus, it is vital to understand the factors which causes imbalance in the data (or class imbalance). Such hidden biases and imbalances can lead to data tyranny, and a major challenge to a data democracy. In this chapter, two essential statistical elements are resolved: the degree of class imbalance and the complexity of the concept, solving such issues helps in building the foundations of a data democracy. Further, statistical measures which are appropriate in these scenarios are discussed and implemented on a real-life dataset (car insurance claims). In the end, popular data-level methods such as Random Oversampling, Random Undersampling, SMOTE, Tomek Link, and others are implemented in Python, and their performance is compared. Keywords - Imbalanced Data, Degree of Class Imbalance, Complexity of the Concept, Statistical Assessment Metrics, Undersampling and Oversampling 1. Motivation & Introduction In the real-world, data are collected from various sources like social networks, websites, logs, and databases. Whilst dealing with data from different sources, it is very crucial to check the quality of the data [1]. Data with questionable quality can introduce different types of biases in various stages of the data science lifecycle. These biases sometime can affect the association between variables, and in many cases could represent the opposite of the actual behavior [2].


Stop using SMOTE to handle all your Imbalanced Data

#artificialintelligence

In classification tasks, one may encounter a situation where the target class label is not equally distributed. Such a dataset can be termed Imbalanced data. Imbalance in data can be a blocker to train a data science model. In case of imbalance class problems, the model is trained mainly on the majority class and the model becomes biased towards the majority class prediction. Hence handling of imbalance class is essential before proceeding to the modeling pipeline.


Undersampling Algorithms for Imbalanced Classification

#artificialintelligence

Taken from Improving Identification of Difficult Small Classes by Balancing Class Distribution. This technique can be implemented using the NeighbourhoodCleaningRule imbalanced-learn class. The number of neighbors used in the ENN and CNN steps can be specified via the n_neighbors argument that defaults to three. The threshold_cleaning controls whether or not the CNN is applied to a given class, which might be useful if there are multiple minority classes with similar sizes. This is kept at 0.5.


Imbalanced Datasets

@machinelearnbot

Imagine you are a medical professional who is training a classifier to detect whether an individual has an extremely rare disease. You train your classifier, and it yields 99.9% accuracy on your test set. You're overcome with joy by these results, but when you check the labels outputted by the classifier, you see it always outputted "No Disease," regardless of the patient data. Because the disease is extremely rare, there were only a handful of patients with the disease in your dataset compared the thousands of patients without the disease. Because over 99.9% of the patients in your dataset don't have the disease, any classifier can achieve an impressively high accuracy simply by returning "No Disease" to every new patient.